Claims management requires different levels of check before a claim can be approved and a payment can be made. In unexpected events, customers expect their insurer to provide support as soon as possible. With the increasing needs and customer’s expectation, BNP Paribas Cardiff offered a challenge in Kaggle to accelerate it’s claims process.
In the challenge, BNP Paribas Cardif provided an anonymized dataset with two categories of claims:
claims for which approval could be accelerated leading to faster payments
claims for which additional information is required before approval.
The task is to predict the category of a claim based on features available early in the process.
The main motivation for the project is to firstly, explore and gain some insights from the dataset, and secondly to apply machine learning models to predict the outcome.
The evaluation metric used to assess model accuracy is Log Loss defined as
\[logLoss =-\frac{1}{N}\sum_{i=1}^N (y_{i}logp_{i} + (1-y_{i})log(1-p_{i}))\]
where N is the number of observations, log is the natural logarithm, \(y_{i}\) is the binary target, and \(p_{i}\) is the predicted probability that \(y_{i}\) equals 1.
BNP provided two datasets - a training and test dataset for submission. We only used the training dataset in the project as the test dataset is used for the competition. The dataset contains both categorical and numeric variables available when the claims were received by BNP Paribas Cardif. All string type variables are categorical. There are no ordinal variables. The “target” variable in the training dataset is the variable to predict and is equal to 1 for claims suitable for an accelerated approval.
Read the dataset
library(vcd)
library(readr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(dplyr)
library(tidyr)
library(mice)
library(VIM)
library(caret)
theme_set(theme_bw(base_size = 20))
## Download the csv.gz file from https://github.com/ChoonC/CS107_Project
filename <- "train.csv.gz"
train <- read_csv(gzfile(filename))
#convert character variables to factor
train[sapply(train, is.character)] <- lapply(train[sapply(train, is.character)], as.factor)
There are 114,321 observations and 133 variables in the dataset. Apart from the variables “ID” and “target”, all other variables/features (V1-V131) are anonymized meaning we don’t know what the values represent. There are no duplicates ID in the dataset, each observation is a unique claim.
Typically with personal insurance claims data one would expect information such as policy number, customer ID, contract date, claims date, events date, product, subproduct, age, processed date, etc.. As the data is anonymized, it is unknown what has been provided and the process taken to anonymize the data.
dim(train)
## [1] 114321 133
str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 114321 obs. of 133 variables:
## $ ID : int 3 4 5 6 8 9 12 21 22 23 ...
## $ target: int 1 1 1 1 1 0 0 1 0 1 ...
## $ v1 : num 1.336 NA 0.944 0.797 NA ...
## $ v2 : num 8.73 NA 5.31 8.3 NA ...
## $ v3 : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 NA 3 ...
## $ v4 : num 3.92 NA 4.41 4.23 NA ...
## $ v5 : num 7.92 9.19 5.33 11.63 NA ...
## $ v6 : num 2.6 NA 3.98 2.1 NA ...
## $ v7 : num 3.18 NA 3.93 1.99 NA ...
## $ v8 : num 0.0129 2.3016 0.0196 0.1719 NA ...
## $ v9 : num 10 NA 12.67 8.97 NA ...
## $ v10 : num 0.503 1.313 0.766 6.543 1.05 ...
## $ v11 : num 16.4 NA 14.8 16.3 NA ...
## $ v12 : num 6.09 6.51 6.38 9.65 6.32 ...
## $ v13 : num 2.87 NA 2.51 3.9 NA ...
## $ v14 : num 11.6 11.6 9.6 14.1 11 ...
## $ v15 : num 1.36 NA 1.98 1.95 NA ...
## $ v16 : num 8.57 NA 5.88 5.52 NA ...
## $ v17 : num 3.67 NA 3.17 3.61 NA ...
## $ v18 : num 0.107 NA 0.245 1.224 NA ...
## $ v19 : num 0.149 NA 0.144 0.232 NA ...
## $ v20 : num 18.9 NA 18 18.4 NA ...
## $ v21 : num 7.73 6.76 5.25 7.52 6.41 ...
## $ v22 : Factor w/ 18210 levels "AA","AAA","AAAA",..: 16671 7734 7087 1511 8038 4580 11283 3862 7522 12770 ...
## $ v23 : num -1.72e-08 NA -2.79e-07 -4.81e-07 NA ...
## $ v24 : Factor w/ 5 levels "A","B","C","D",..: 3 3 5 4 5 1 5 4 5 3 ...
## $ v25 : num 0.139 3.056 0.114 0.149 NA ...
## $ v26 : num 1.72 NA 2.24 1.31 NA ...
## $ v27 : num 3.39 NA 5.31 2.3 NA ...
## $ v28 : num 0.59 NA 0.836 8.927 NA ...
## $ v29 : num 8.88 NA 7.5 8.87 NA ...
## $ v30 : Factor w/ 7 levels "A","B","C","D",..: 3 3 NA 3 NA NA 7 3 NA 3 ...
## $ v31 : Factor w/ 3 levels "A","B","C": 1 1 1 2 1 1 1 1 NA 1 ...
## $ v32 : num 1.08 NA 1.45 1.59 NA ...
## $ v33 : num 1.01 NA 1.73 1.67 NA ...
## $ v34 : num 7.27 3.62 4.04 8.7 6.08 ...
## $ v35 : num 8.38 NA 7.96 8.9 NA ...
## $ v36 : num 11.3 14.6 12.7 11.3 NA ...
## $ v37 : num 0.455 NA 0.26 0.434 NA ...
## $ v38 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ v39 : num 4.012 NA 7.379 0.287 NA ...
## $ v40 : num 7.71 14.31 13.08 11.52 10.14 ...
## $ v41 : num 7.65 NA 6.17 7.93 NA ...
## $ v42 : num 12.7 NA 12.3 12.9 NA ...
## $ v43 : num 2.02 NA 2.93 1.47 NA ...
## $ v44 : num 10.5 NA 8.9 12.7 NA ...
## $ v45 : num 9.85 NA 5.34 9.67 NA ...
## $ v46 : num 0.114 2.45 0.126 0.108 NA ...
## $ v47 : Factor w/ 10 levels "A","B","C","D",..: 3 5 3 3 9 9 3 9 4 9 ...
## $ v48 : num 12.2 NA 12.7 12.2 NA ...
## $ v49 : num 8.09 NA 6.84 8.59 NA ...
## $ v50 : num 0.899 1.379 0.605 3.329 1.365 ...
## $ v51 : num 7.28 NA 9.64 4.78 NA ...
## $ v52 : Factor w/ 12 levels "A","B","C","D",..: 7 7 6 8 8 11 1 3 8 1 ...
## $ v53 : num 16.7 NA 15.1 16.6 NA ...
## $ v54 : num 0.0371 1.1295 0.0856 0.1397 NA ...
## $ v55 : num 1.3 NA 0.765 1.178 NA ...
## $ v56 : Factor w/ 122 levels "A","AA","AB",..: 86 102 18 48 NA 101 18 93 6 NA ...
## $ v57 : num 3.97 NA 4.03 3.97 NA ...
## $ v58 : num 0.53 NA 4.28 1.73 NA ...
## $ v59 : num 10.89 NA 9.11 11.78 NA ...
## $ v60 : num 1.59 NA 2.15 1.23 NA ...
## $ v61 : num 15.9 NA 16.1 15.9 NA ...
## $ v62 : int 1 2 1 1 1 1 2 2 0 3 ...
## $ v63 : num 0.153 2.545 0.124 0.14 NA ...
## $ v64 : num 6.36 NA 5.52 6.29 NA ...
## $ v65 : num 18.3 NA 16.4 17 NA ...
## $ v66 : Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 3 2 1 ...
## $ v67 : num 9.31 NA 8.37 9.7 NA ...
## $ v68 : num 15.2 NA 11 18.6 NA ...
## $ v69 : num 17.14 NA 5.88 9.43 NA ...
## $ v70 : num 11.78 12.05 8.46 13.59 NA ...
## $ v71 : Factor w/ 9 levels "A","B","C","D",..: 5 5 2 5 5 5 5 5 2 5 ...
## $ v72 : int 1 2 3 2 1 1 2 2 0 3 ...
## $ v73 : num 1.61 NA 2.41 2.27 NA ...
## $ v74 : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
## $ v75 : Factor w/ 4 levels "A","B","C","D": 4 4 2 4 4 4 4 4 2 4 ...
## $ v76 : num 2.23 NA 1.96 2.19 NA ...
## $ v77 : num 7.29 NA 5.92 8.21 NA ...
## $ v78 : num 8.57 NA 11.76 13.45 NA ...
## $ v79 : Factor w/ 18 levels "A","B","C","D",..: 5 4 5 2 3 9 5 3 16 3 ...
## $ v80 : num 3 NA 3.33 1.95 NA ...
## $ v81 : num 7.53 7.28 10.19 4.8 NA ...
## $ v82 : num 8.86 3.43 8.27 13.32 NA ...
## $ v83 : num 0.65 NA 1.53 1.68 NA ...
## $ v84 : num 1.3 NA 1.53 1.38 NA ...
## $ v85 : num 1.71 NA 2.43 1.59 NA ...
## $ v86 : num 0.866 NA 1.071 1.243 NA ...
## $ v87 : num 9.55 9.85 8.45 10.75 NA ...
## $ v88 : num 3.32 NA 3.37 1.41 NA ...
## $ v89 : num 0.0957 2.6786 0.1114 0.0391 NA ...
## $ v90 : num 0.905 NA 0.811 1.042 NA ...
## $ v91 : Factor w/ 7 levels "A","B","C","D",..: 1 2 7 2 7 7 2 1 3 1 ...
## $ v92 : num 0.442 NA 0.271 0.764 NA ...
## $ v93 : num 5.81 NA 5.16 5.5 NA ...
## $ v94 : num 3.52 NA 4.21 3.42 NA ...
## $ v95 : num 0.462 NA 0.31 0.833 NA ...
## $ v96 : num 7.44 NA 5.66 7.38 NA ...
## $ v97 : num 5.45 NA 5.97 6.75 NA ...
## [list output truncated]
n_obs <- train %>% count(unique(ID)) %>% nrow()
n_obs
## [1] 114321
table(train$target)
##
## 0 1
## 27300 87021
prop.table(table(train$target))
##
## 0 1
## 0.2388013 0.7611987
#count number of numeric features
sum(sapply(train, is.numeric))
## [1] 114
#count number of categorical features
sum(sapply(train, is.factor))
## [1] 19
#Create tables by data types for exploration, keep ID and target on both for sorting
train_num <- train[,sapply(train, is.numeric)] %>%
arrange(-target, ID)
train_fac <- bind_cols(select(train, c(ID, target)),
train[,sapply(train, is.factor)]) %>%
arrange(-target, ID)
87,021 or 76.1% of the claims were classified as suitable for accelerated approval. There are 112 numeric features in addition to the ID and target variables, and 19 categorical features.
Four of the numeric features contains discrete values (v38, v62, v72, v129) which might suggest some kind of counts. All other variables are continuous.
From the summary, we can see that a large number of the numeric features have missing values and most have values in the range 0 - 20. (v14 has 40 values in the order of -0.000001, which we regard as 0).
#summarise the features
sapply(select(train_num, -ID),summary)
## $target
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.7612 1.0000 1.0000
##
## $v1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.91 1.47 1.63 2.14 20.00 49832
##
## $v2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.32 7.02 7.46 9.46 20.00 49796
##
## $v4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.49 4.21 4.14 4.83 20.00 49796
##
## $v5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.61 8.67 8.74 9.77 20.00 48624
##
## $v6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.06 2.41 2.44 2.77 20.00 49832
##
## $v7
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.10 2.45 2.48 2.83 20.00 49832
##
## $v8
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.09 0.39 1.50 1.62 20.00 48619
##
## $v9
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.85 9.06 9.03 10.23 20.00 49851
##
## $v10
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.050 1.313 1.883 2.101 18.530 84
##
## $v11
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 15.00 15.50 15.45 15.95 20.00 49836
##
## $v12
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.322 6.613 6.881 7.020 18.710 86
##
## $v13
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.07 3.59 3.80 4.29 20.00 49832
##
## $v14
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.000001 11.260000 11.970000 12.090000 12.720000 20.000000 4
##
## $v15
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.61 1.99 2.08 2.42 20.00 49836
##
## $v16
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.87 4.93 4.92 5.96 20.00 49895
##
## $v17
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.70 3.55 3.83 4.51 20.00 49796
##
## $v18
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.51 0.77 0.84 1.07 20.00 49832
##
## $v19
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.17 0.20 0.22 0.24 20.00 49843
##
## $v20
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.52 17.33 18.04 17.77 18.54 20.00 49840
##
## $v21
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1062 6.4160 7.0450 7.0300 7.6710 19.3000 611
##
## $v23
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 1.09 0.00 20.00 50675
##
## $v25
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.04 0.15 0.47 1.70 1.95 20.00 48619
##
## $v26
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.51 1.83 1.88 2.18 20.00 49832
##
## $v27
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.20 2.67 2.74 3.22 20.00 49832
##
## $v28
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.49 5.04 5.09 6.58 19.85 49832
##
## $v29
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.43 8.30 8.21 9.09 20.00 49832
##
## $v32
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.26 1.56 1.62 1.90 17.56 49832
##
## $v33
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.47 1.95 2.16 2.63 20.00 49832
##
## $v34
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.054 6.537 6.406 7.703 20.000 111
##
## $v35
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.25 8.07 8.12 8.94 20.00 49832
##
## $v36
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 11.77 13.77 13.38 15.32 20.00 48624
##
## $v37
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.40 0.64 0.74 0.95 20.00 49843
##
## $v38
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09093 0.00000 12.00000
##
## $v39
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.13 0.38 1.24 1.19 19.92 49836
##
## $v40
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 8.408 10.330 10.470 12.770 20.000 111
##
## $v41
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.54 7.20 7.18 7.83 20.00 49832
##
## $v42
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.34 12.93 12.92 13.49 20.00 49832
##
## $v43
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.79 2.15 2.22 2.56 20.00 49836
##
## $v44
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.57 10.78 10.80 12.02 19.83 49796
##
## $v45
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.83 9.16 9.14 10.42 20.00 49832
##
## $v46
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.07 0.12 0.44 1.63 1.82 20.00 48619
##
## $v48
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 11.22 12.41 12.54 13.78 20.00 49796
##
## $v49
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.47 8.02 8.02 8.56 20.00 49832
##
## $v50
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.6588 1.2120 1.5040 2.0070 20.0000 86
##
## $v51
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.60 7.13 7.20 8.64 20.00 50678
##
## $v53
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 15.28 15.77 15.71 16.22 20.00 49836
##
## $v54
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.01 0.09 0.31 1.25 1.41 20.00 48619
##
## $v55
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.98 1.37 1.56 1.94 20.00 49832
##
## $v57
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.65 4.07 4.08 4.49 20.00 49832
##
## $v58
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.50 5.33 7.70 13.96 20.00 49836
##
## $v59
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.06 10.54 10.59 12.03 20.00 49796
##
## $v60
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.36 1.67 1.71 2.01 20.00 49832
##
## $v61
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 13.60 15.08 14.58 16.11 18.85 49796
##
## $v62
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.031 1.000 7.000
##
## $v63
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.05 0.14 0.46 1.69 1.85 20.00 48619
##
## $v64
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.79 6.11 6.34 7.52 20.00 49796
##
## $v65
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.66 15.03 16.26 15.85 17.16 20.00 49840
##
## $v67
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 8.58 9.31 9.29 9.99 20.00 49832
##
## $v68
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.50 17.08 18.27 17.56 18.91 20.00 49836
##
## $v69
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 8.39 9.52 9.45 10.54 20.00 49895
##
## $v70
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.43 10.80 12.49 12.27 13.99 19.82 48636
##
## $v72
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.432 2.000 12.000
##
## $v73
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.90 2.33 2.43 2.85 20.00 49836
##
## $v76
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.63 2.17 2.40 2.81 20.00 49796
##
## $v77
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.50 7.38 7.31 8.16 15.97 49832
##
## $v78
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.31 13.33 13.33 14.39 20.00 49895
##
## $v80
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.45 2.09 2.21 2.86 20.00 49851
##
## $v81
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.98 7.52 7.29 8.78 20.00 48624
##
## $v82
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.38 3.69 6.21 8.79 20.00 48624
##
## $v83
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.43 1.94 2.17 2.67 20.00 49832
##
## $v84
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.96 1.42 1.61 2.07 20.00 49832
##
## $v85
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.84 2.60 2.82 3.57 20.00 50682
##
## $v86
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.94 1.16 1.22 1.42 17.56 49832
##
## $v87
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.87 8.05 10.00 10.18 12.23 19.84 48663
##
## $v88
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.18 1.76 1.92 2.46 20.00 49832
##
## $v89
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.02 0.10 0.33 1.52 1.75 20.00 48619
##
## $v90
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.86 0.97 0.97 1.06 6.31 49836
##
## $v92
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.44 0.54 0.58 0.68 8.92 49843
##
## $v93
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.55 5.30 5.47 6.22 20.00 49832
##
## $v94
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.33 3.74 3.85 4.22 19.02 49832
##
## $v95
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.50 0.62 0.67 0.77 9.07 49843
##
## $v96
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.76 6.51 6.46 7.23 20.00 49832
##
## $v97
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.32 7.45 7.62 8.78 20.00 49843
##
## $v98
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.12 7.64 7.67 9.06 19.06 48654
##
## $v99
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.93 1.24 1.25 1.55 20.00 49832
##
## $v100
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.75 14.48 12.09 18.32 20.00 49836
##
## $v101
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.26 6.62 6.87 8.24 20.00 49796
##
## $v102
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.79 2.46 2.89 3.41 20.00 51316
##
## $v103
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.50 5.13 5.30 5.88 18.78 49832
##
## $v104
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.14 2.51 2.64 2.95 20.00 49832
##
## $v105
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.06 0.24 1.08 1.02 20.00 48658
##
## $v106
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 10.05 12.09 11.79 13.77 20.00 49796
##
## $v108
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.53 1.98 2.15 2.54 20.00 48624
##
## $v109
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.84 3.09 4.18 5.15 20.00 48624
##
## $v111
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.32 3.11 3.37 4.12 20.00 49832
##
## $v114
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 14.04 13.57 15.37 20.00 30
##
## $v115
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.46 10.48 10.55 11.61 20.00 49895
##
## $v116
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.85 2.22 2.29 2.65 20.00 49836
##
## $v117
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.76 8.07 8.30 10.50 20.00 48624
##
## $v118
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.98 8.14 8.37 9.56 20.00 49843
##
## $v119
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.32 1.46 3.17 4.17 20.00 50680
##
## $v120
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.78 1.15 1.29 1.65 10.39 49836
##
## $v121
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.79 2.44 2.74 3.38 20.00 49840
##
## $v122
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.65 6.75 6.82 7.91 20.00 49851
##
## $v123
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.02 1.96 2.74 3.55 4.08 19.69 50678
##
## $v124
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.02 0.14 0.92 0.87 20.00 48619
##
## $v126
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.42 1.62 1.67 1.84 15.63 49832
##
## $v127
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.10 2.96 3.24 4.11 20.00 49832
##
## $v128
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.39 1.80 2.03 2.39 20.00 48624
##
## $v129
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3101 0.0000 11.0000
##
## $v130
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.11 1.56 1.93 2.33 20.00 49843
##
## $v131
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.01 1.59 1.74 2.26 20.00 49895
# Look at the less than 0 values in v14
select(train_num,v14) %>%
filter(v14<0) %>%
arrange(-v14)
## Source: local data frame [40 x 1]
##
## v14
## (dbl)
## 1 -2.234954e-08
## 2 -4.923125e-08
## 3 -6.485695e-08
## 4 -7.864030e-08
## 5 -8.883815e-08
## 6 -1.024361e-07
## 7 -1.381705e-07
## 8 -1.543122e-07
## 9 -1.642419e-07
## 10 -1.713699e-07
## .. ...
Only 4 of the numeric features have no missing values (v38, v62, v72, v129). The majority (100) of the features have proportion of missing values between 42.5% and 44.5%. The rest of the features have less than 0.053% missing values. The proportion of missing values is slightly greater in claims which were accelerated (target=1) compared to those that require further information (target=0).
# Function to return a table of count and missing pct for all features
getMissing <- function(tbl){
data.frame(feature.names = colnames(tbl),
n_Miss=apply(tbl, 2, function(i){sum(is.na(i))}),
pct_Miss=apply(tbl, 2, function(i){sum(is.na(i))}/length(i))) %>%
arrange(-n_Miss)
}
#Overall missingness
missing_num <- getMissing(select(train_num, -ID))
table(missing_num$n_Miss)
##
## 0 4 30 84 86 111 611 48619 48624 48636 48654 48658
## 5 1 1 1 2 2 1 7 8 1 1 1
## 48663 49796 49832 49836 49840 49843 49851 49895 50675 50678 50680 50682
## 1 11 34 12 3 7 3 5 1 2 1 1
## 51316
## 1
#Missingness for target = 1
missing_tgt1 <- getMissing(select(filter(train_num, target == 1), -ID))
names(missing_tgt1) <- c("feature.names", "n_Miss_tgt1", "pct_Miss_tgt1")
#Join the tables
missing_num <- missing_num %>%
inner_join(missing_tgt1)
## Joining by: "feature.names"
slice(missing_num, 1:10)
## feature.names n_Miss pct_Miss n_Miss_tgt1 pct_Miss_tgt1
## 1 v102 51316 0.4488764 39472 0.4535917
## 2 v85 50682 0.4433306 39037 0.4485929
## 3 v119 50680 0.4433131 39035 0.4485699
## 4 v51 50678 0.4432956 39033 0.4485469
## 5 v123 50678 0.4432956 39033 0.4485469
## 6 v23 50675 0.4432694 39031 0.4485239
## 7 v16 49895 0.4364465 38424 0.4415486
## 8 v69 49895 0.4364465 38424 0.4415486
## 9 v78 49895 0.4364465 38424 0.4415486
## 10 v115 49895 0.4364465 38424 0.4415486
rm(missing_tgt1)
For the features with high proportion of missing values, the count of the observations with missing values takes on a fixed number of values. This is true also for the target group. This suggests that those groups of features tend to have missing values together and that they might be related to types of claims (product types).
We use the VIM package to get a better understanding of the missing data. (ref:[r-bloggers imputing missing data with r mice package])(http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/)
62.561 (55%) of the observations are complete in numeric values. 47,715 (42%) have missing values for the same group of features It confirms our suspicion that the same group of festures have missing values. There appears to be some patterns and clusters of observations. The pattern of missing values do not appear to be too different for the target group. The missing values are unlikely to be random but more likely to be related to be types of claims or products and questions asked regarding the claims.
Due to the large proportion of features with missing values, excluding those observations or features would lead to loss of information. We impute the missing values for features where the proportion of missing values is less than 5%, namely v10,v12,v14,v21,v34,v40,v50 and v114. We then add an indicator to the dataset to indicate observations with no missing values and those with missing values.
#Visualise missing data
tmp <- select(train_num, -c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data"))
##
## Variables sorted by number of missings:
## Variable Count
## v102 51316
## v85 50682
## v119 50680
## v51 50678
## v123 50678
## v23 50675
## v16 49895
## v69 49895
## v78 49895
## v115 49895
## v131 49895
## v9 49851
## v80 49851
## v122 49851
## v19 49843
## v37 49843
## v92 49843
## v95 49843
## v97 49843
## v118 49843
## v130 49843
## v20 49840
## v65 49840
## v121 49840
## v11 49836
## v15 49836
## v39 49836
## v43 49836
## v53 49836
## v58 49836
## v68 49836
## v73 49836
## v90 49836
## v100 49836
## v116 49836
## v120 49836
## v1 49832
## v6 49832
## v7 49832
## v13 49832
## v18 49832
## v26 49832
## v27 49832
## v28 49832
## v29 49832
## v32 49832
## v33 49832
## v35 49832
## v41 49832
## v42 49832
## v45 49832
## v49 49832
## v55 49832
## v57 49832
## v60 49832
## v67 49832
## v77 49832
## v83 49832
## v84 49832
## v86 49832
## v88 49832
## v93 49832
## v94 49832
## v96 49832
## v99 49832
## v103 49832
## v104 49832
## v111 49832
## v126 49832
## v127 49832
## v2 49796
## v4 49796
## v17 49796
## v44 49796
## v48 49796
## v59 49796
## v61 49796
## v64 49796
## v76 49796
## v101 49796
## v106 49796
## v87 48663
## v105 48658
## v98 48654
## v70 48636
## v5 48624
## v36 48624
## v81 48624
## v82 48624
## v108 48624
## v109 48624
## v117 48624
## v128 48624
## v8 48619
## v25 48619
## v46 48619
## v54 48619
## v63 48619
## v89 48619
## v124 48619
## v21 611
## v34 111
## v40 111
## v12 86
## v50 86
## v10 84
## v114 30
## v14 4
## v38 0
## v62 0
## v72 0
## v129 0
#Target = 1
tmp <- train_num %>% filter(target==1) %>% select(-c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data target 1"))
##
## Variables sorted by number of missings:
## Variable Count
## v102 39472
## v85 39037
## v119 39035
## v51 39033
## v123 39033
## v23 39031
## v16 38424
## v69 38424
## v78 38424
## v115 38424
## v131 38424
## v9 38391
## v80 38391
## v122 38391
## v19 38384
## v37 38384
## v92 38384
## v95 38384
## v97 38384
## v118 38384
## v130 38384
## v20 38383
## v65 38383
## v121 38383
## v11 38379
## v15 38379
## v39 38379
## v43 38379
## v53 38379
## v58 38379
## v68 38379
## v73 38379
## v90 38379
## v100 38379
## v116 38379
## v120 38379
## v1 38376
## v6 38376
## v7 38376
## v13 38376
## v18 38376
## v26 38376
## v27 38376
## v28 38376
## v29 38376
## v32 38376
## v33 38376
## v35 38376
## v41 38376
## v42 38376
## v45 38376
## v49 38376
## v55 38376
## v57 38376
## v60 38376
## v67 38376
## v77 38376
## v83 38376
## v84 38376
## v86 38376
## v88 38376
## v93 38376
## v94 38376
## v96 38376
## v99 38376
## v103 38376
## v104 38376
## v111 38376
## v126 38376
## v127 38376
## v2 38350
## v4 38350
## v17 38350
## v44 38350
## v48 38350
## v59 38350
## v61 38350
## v64 38350
## v76 38350
## v101 38350
## v106 38350
## v87 37545
## v105 37540
## v98 37538
## v70 37525
## v5 37517
## v36 37517
## v81 37517
## v82 37517
## v108 37517
## v109 37517
## v117 37517
## v128 37517
## v8 37512
## v25 37512
## v46 37512
## v54 37512
## v63 37512
## v89 37512
## v124 37512
## v21 535
## v34 102
## v40 102
## v12 85
## v50 85
## v10 83
## v114 22
## v14 4
## v38 0
## v62 0
## v72 0
## v129 0
rm(tmp)
The distribution of the numeric features are mostly either normally distributed or skewed. There are a few features with concave distribution and some features representing intervals (maybe time). The density plot of the variables suggest some features which might be predictive: v10,v14,v2,v34,v36,v40,v50,v58,v100,v106,v114,v98,v119,v123,v129
#Plot the density by target
plotDensity <- function(df){
d <- gather(df, key=variable, value=value, -target, na.rm = TRUE) %>%
mutate(target=as.character(target))
ggplot(d,aes(x = value, ..scaled..,fill=target)) +
geom_density(alpha=0.4) +
facet_wrap(~variable,scales = "free")
}
select(train_num, 2,3:22) %>% plotDensity()
select(train_num, 2,23:42) %>% plotDensity()
select(train_num, 2,43:62) %>% plotDensity()
select(train_num, 2,63:82) %>% plotDensity()
select(train_num, 2,83:102) %>% plotDensity()
select(train_num, 2,103:114) %>% plotDensity()
One of the categorical feature (v22) have 18,210 levels and two variables with 90 and 122 levels. There are 8 categorical features with no missing values (v24, v47, v66, v71, v74, v75, v79, v110). Two features (v30, v113) with almost 50% missing values, and the rest of the categorical features with less than 6% missing values.
From the visualisation, the proportion of missing value in v113 is greater in the target = 1 group. V3 and v31 have the same observations where their values are missing. They have the same number of categories but have different values.
We looked at the combination of missing values between v30 and v113 and the numeric variables with high proportion of missing values. 19% of v30 had missing values together with the numeric variables, 17% of observations had missing values in v113 only, 12% missing in both v30 and v113, but not the numeric variables, 12% on both the categorical and the numeric features, 8.3% in v30 only and 6.5% on v113 and the numeric features. The missingness of either v30 or v113 do not indicate that the values for the numeric features are also missing.
sapply(train_fac,summary)
## $ID
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 57280 114200 114200 171200 228700
##
## $target
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.7612 1.0000 1.0000
##
## $v3
## A B C NA's
## 227 53 110584 3457
##
## $v22
## AGDF YGJ QKI PWR HZE MNZ PTO ROZ YOD
## 2386 2119 668 649 423 382 350 344 303
## GBS VZF AJQ WNI HDD NRT PFR YEP ADDF
## 299 296 252 244 243 238 226 224 212
## VVI QVR WRI ABOF PSE NGS ACWE TVR PTJ
## 211 210 207 204 199 197 190 182 178
## ADMP UAG AGON AHE NWG ABF TG EJC AAPP
## 172 169 167 167 167 166 166 165 164
## OFD AMR AFYU ADMI AFOZ DJU PBC GEJ AXH
## 161 160 159 156 152 152 152 150 148
## QKP HUU MQE RIC JGY ACHJ KLZ GEB WFT
## 147 146 146 146 138 137 136 132 132
## ACXD ADGN AWT NXE AGZT BLE LIP LJS TTX
## 131 129 128 128 127 126 126 124 124
## ZLH CMG UBK AAGW AGMO QOE AJO EIA ONA
## 124 123 123 122 122 121 118 117 116
## YBV OPF GXR KOX MLO AFD DLW IAD EJZ
## 114 113 112 111 111 109 105 105 104
## OP BQS RXB SPT WPS ADLK SBE UUC AEXD
## 104 103 103 103 103 102 102 102 101
## AFUF ESJ GHP AHBA FVQ NTM HOF SEZ (Other)
## 101 101 101 100 100 100 99 99 93095
## NA's
## 500
##
## $v24
## A B C D E
## 3789 8150 20872 26333 55177
##
## $v30
## A B C D E F G NA's
## 2313 205 32178 5225 2973 2589 8728 60110
##
## $v31
## A B C NA's
## 88347 18947 3570 3457
##
## $v47
## A B C D E F G H I J
## 38 50 55425 3157 5301 4322 3946 1 39071 3010
##
## $v52
## A B C D E F G H I J K L
## 8925 9385 9681 9607 9282 9806 9419 8323 10260 11103 8949 9578
## NA's
## 3
##
## $v56
## BW DI AS BZ AW DO P DP CY
## 11351 10256 8832 7174 6369 5289 4997 4647 3830
## CN BL AF BJ AG DX DY BV U
## 3745 3570 2355 2345 2164 2095 1971 1899 1675
## DJ N DS V BX DH DF AL DR
## 1660 1638 1513 1477 1387 1298 1119 966 932
## BK BM G CS DA BQ CP R AZ
## 921 877 770 698 694 616 598 595 397
## BA CM AI Y AR CI AO Z A
## 394 393 323 314 260 222 211 199 186
## AN CC CF BH DL C DM BI DK
## 178 153 141 118 117 105 102 97 86
## DN DZ AK CO BS BG CH BY AA
## 81 80 69 66 63 62 62 61 60
## BU BF BD AY CA DG AH DE AU
## 57 54 41 34 31 26 25 18 17
## DU DV CB AC F CL CT CW DQ
## 15 15 13 11 11 9 9 9 9
## CX W BP BR CQ DW Q AV CV
## 7 7 6 6 6 6 6 5 5
## AT E B BE BN DT AB AE (Other)
## 4 4 3 3 3 3 2 2 34
## NA's
## 6882
##
## $v66
## A B C
## 70353 18264 25704
##
## $v71
## A B C D F G I K L
## 1 30255 8947 1 75094 5 16 1 1
##
## $v74
## A B C
## 45 113560 716
##
## $v75
## A B C D
## 18 39192 24 75087
##
## $v79
## A B C D E F G H I J K L
## 417 25801 34561 5302 25257 571 6 2004 4561 933 4308 1
## M N O P Q R
## 3946 49 3331 2217 1006 50
##
## $v91
## A B C D E F G NA's
## 27079 22683 23157 230 3206 13418 24545 3
##
## $v107
## A B C D E F G NA's
## 13418 22683 24545 23157 27079 3206 230 3
##
## $v110
## A B C
## 55688 55426 3207
##
## $v112
## A B C D E F G H I J K L
## 9545 2688 2055 7327 4748 21671 1702 5651 10224 3967 3252 4479
## M N O P Q R S T U V NA's
## 1193 9086 3661 4675 2346 4170 833 3980 4803 1883 382
##
## $v113
## A AA AB AC AD AE AF AG AH AI AJ AK
## 334 1 1273 5956 265 680 3568 1712 797 313 872 1
## B C D E F G H I J L M N
## 1358 707 142 324 233 16252 41 2605 322 477 7374 1186
## O P Q R S T U V W X Y Z
## 100 1975 594 181 614 1608 612 1673 1497 1635 1282 453
## NA's
## 55304
##
## $v125
## A AA AB AC AD AE AF AG AH AI AJ AK AL AM AN
## 1528 542 189 1945 486 833 1116 931 343 905 82 5337 1031 444 1539
## AO AP AQ AR AS AT AU AV AW AX AY AZ B BA BB
## 644 3410 469 2229 427 1114 1110 538 766 13 1042 2416 2374 349 68
## BC BD BE BF BG BH BI BJ BK BL BM BN BO BP BQ
## 394 2452 864 511 355 1145 720 4465 1477 1214 5759 910 970 504 687
## BR BS BT BU BV BW BX BY BZ C CA CB CC CD CE
## 642 240 407 1369 806 2478 1297 3311 3 514 1523 673 1082 2059 1164
## CF CG CH CI CJ CK CL D E F G H I J K
## 927 3826 701 397 1446 623 461 969 2521 221 2594 3212 253 907 2835
## L M N O P Q R S T U V W X Y Z
## 2502 636 1033 248 1463 1066 1217 809 615 1083 3234 888 867 880 1595
## NA's
## 77
#Get the number of levels for each categorical features
tmp <- train_fac %>% select(-c(ID, target))
level<- sapply(tmp,function(x)nlevels(x))
level
## v3 v22 v24 v30 v31 v47 v52 v56 v66 v71 v74 v75
## 3 18210 5 7 3 10 12 122 3 9 3 4
## v79 v91 v107 v110 v112 v113 v125
## 18 7 7 3 22 36 90
#Overall missingness
missing_fac <- getMissing(tmp)
table(missing_fac$n_Miss)
##
## 0 3 77 382 500 3457 6882 55304 60110
## 8 3 1 1 1 2 1 1 1
#Missingness for target = 1
missing_tgt1 <- getMissing(select(filter(train_fac, target == 1), -ID))
names(missing_tgt1) <- c("feature.names", "n_Miss_tgt1", "pct_Miss_tgt1")
#Join the tables
missing_fac <- missing_fac %>%
inner_join(missing_tgt1)
## Warning in inner_join_impl(x, y, by$x, by$y): joining factors with
## different levels, coercing to character vector
slice(missing_fac, 1:10)
## feature.names n_Miss pct_Miss n_Miss_tgt1 pct_Miss_tgt1
## 1 v30 60110 0.5258001592 45881 5.272406e-01
## 2 v113 55304 0.4837606389 45564 5.235978e-01
## 3 v56 6882 0.0601989136 4284 4.922950e-02
## 4 v3 3457 0.0302394136 2780 3.194631e-02
## 5 v31 3457 0.0302394136 2780 3.194631e-02
## 6 v22 500 0.0043736496 410 4.711506e-03
## 7 v112 382 0.0033414683 299 3.435952e-03
## 8 v125 77 0.0006735420 64 7.354547e-04
## 9 v52 3 0.0000262419 3 3.447444e-05
## 10 v91 3 0.0000262419 3 3.447444e-05
#visualise the missing data
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of mising data"))
##
## Variables sorted by number of missings:
## Variable Count
## v30 60110
## v113 55304
## v56 6882
## v3 3457
## v31 3457
## v22 500
## v112 382
## v125 77
## v52 3
## v91 3
## v107 3
## v24 0
## v47 0
## v66 0
## v71 0
## v74 0
## v75 0
## v79 0
## v110 0
#Target = 1
tmp <- train_fac %>% filter(target==1) %>% select(-c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data target=1","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## v30 45881
## v113 45564
## v56 4284
## v3 2780
## v31 2780
## v22 410
## v112 299
## v125 64
## v52 3
## v91 3
## v107 3
## v24 0
## v47 0
## v66 0
## v71 0
## v74 0
## v75 0
## v79 0
## v110 0
rm(missing_tgt1, tmp)
We looked at the cross-tabulation of v30 and v113 with two numeric features v8 and v112 to see if there is a specific categorical value coded when the numeric values are missing. There doesn’t appear to be one as there are missing and non-missing numeric values for all categories.
#Visualise v30 and v113 with the numeric variables
feature.names <- t(select(filter(missing_num, pct_Miss > 0.4),feature.names))
feature.names <- c("v30", "v113", feature.names)
tmp <- train[feature.names]
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## v30 60110
## v113 55304
## v102 51316
## v85 50682
## v119 50680
## v51 50678
## v123 50678
## v23 50675
## v16 49895
## v69 49895
## v78 49895
## v115 49895
## v131 49895
## v9 49851
## v80 49851
## v122 49851
## v19 49843
## v37 49843
## v92 49843
## v95 49843
## v97 49843
## v118 49843
## v130 49843
## v20 49840
## v65 49840
## v121 49840
## v11 49836
## v15 49836
## v39 49836
## v43 49836
## v53 49836
## v58 49836
## v68 49836
## v73 49836
## v90 49836
## v100 49836
## v116 49836
## v120 49836
## v1 49832
## v6 49832
## v7 49832
## v13 49832
## v18 49832
## v26 49832
## v27 49832
## v28 49832
## v29 49832
## v32 49832
## v33 49832
## v35 49832
## v41 49832
## v42 49832
## v45 49832
## v49 49832
## v55 49832
## v57 49832
## v60 49832
## v67 49832
## v77 49832
## v83 49832
## v84 49832
## v86 49832
## v88 49832
## v93 49832
## v94 49832
## v96 49832
## v99 49832
## v103 49832
## v104 49832
## v111 49832
## v126 49832
## v127 49832
## v2 49796
## v4 49796
## v17 49796
## v44 49796
## v48 49796
## v59 49796
## v61 49796
## v64 49796
## v76 49796
## v101 49796
## v106 49796
## v87 48663
## v105 48658
## v98 48654
## v70 48636
## v5 48624
## v36 48624
## v81 48624
## v82 48624
## v108 48624
## v109 48624
## v117 48624
## v128 48624
## v8 48619
## v25 48619
## v46 48619
## v54 48619
## v63 48619
## v89 48619
## v124 48619
#md <- md.pattern(tmp)
#md <- data.frame(obs=as.numeric(rownames(md)),md) %>%
# arrange(-obs)
#filter(md, obs > 1000 | is.na(obs))
#md.pairs(tmp)$mm
#Proportion of missingess by categorical values
feature.names <- c("v30","v113","v102","v8")
tmp <- train[feature.names]
tmp <- tmp %>% mutate(v30=ifelse(is.na(v30),'99',v30),
v113=ifelse(is.na(v113),'99',v113),
miss102=ifelse(is.na(v102),1,0),
miss8=ifelse(is.na(v8),1,0))
table(tmp$v30,tmp$miss102)
##
## 0 1
## 1 1865 448
## 2 143 62
## 3 22667 9511
## 4 4052 1173
## 5 2558 415
## 6 1912 677
## 7 6311 2417
## 99 23497 36613
table(tmp$v30,tmp$miss8)
##
## 0 1
## 1 1908 405
## 2 147 58
## 3 23443 8735
## 4 4217 1008
## 5 2650 323
## 6 1983 606
## 7 6609 2119
## 99 24745 35365
table(tmp$v113,tmp$miss102)
##
## 0 1
## 1 168 166
## 10 134 179
## 11 447 425
## 12 0 1
## 13 683 675
## 14 377 330
## 15 44 98
## 16 154 170
## 17 113 120
## 18 8764 7488
## 19 22 19
## 2 0 1
## 20 1444 1161
## 21 175 147
## 22 260 217
## 23 2190 5184
## 24 928 258
## 25 61 39
## 26 1116 859
## 27 320 274
## 28 99 82
## 29 331 283
## 3 639 634
## 30 432 1176
## 31 347 265
## 32 858 815
## 33 829 668
## 34 970 665
## 35 644 638
## 36 222 231
## 4 3135 2821
## 5 157 108
## 6 274 406
## 7 2038 1530
## 8 908 804
## 9 403 394
## 99 33319 21985
table(tmp$v113,tmp$miss8)
##
## 0 1
## 1 173 161
## 10 148 165
## 11 472 400
## 12 0 1
## 13 713 645
## 14 401 306
## 15 47 95
## 16 158 166
## 17 116 117
## 18 9257 6995
## 19 23 18
## 2 0 1
## 20 1519 1086
## 21 191 131
## 22 269 208
## 23 2257 5117
## 24 970 216
## 25 63 37
## 26 1147 828
## 27 335 259
## 28 101 80
## 29 355 259
## 3 680 593
## 30 458 1150
## 31 359 253
## 32 919 754
## 33 848 649
## 34 1010 625
## 35 681 601
## 36 225 228
## 4 3242 2714
## 5 159 106
## 6 283 397
## 7 2156 1412
## 8 969 743
## 9 421 376
## 99 34577 20727
#prop.table(table(tmp$v30,tmp$v113))
rm(tmp, feature.names)
Most of the categorical features except perhaps v52 and v107 maybe useful for prediction. We also note that the proportion of target=1 is larger in the missing value category for v30 and v113 than non-missing. Hence the missingness of a categorical feature is predictive.
For categorical features, our strategy is to replace missing category with a new categorical value.
#Convert target to factor and reverse order
train_fac$target <- factor(train_fac$target,levels=rev(levels(factor(as.factor(train_fac$target)))))
lvls <- names(level[level <= 50])
out <- NULL
for(i in 1:length(lvls)){
df <- train_fac[c("target",lvls[i])]
out[[i]] <- ggplot(df, aes_string(colnames(df)[2],
fill = colnames(df)[1])) + geom_bar()
}
grid.arrange(out[[1]], out[[2]], out[[3]], ncol = 3)
grid.arrange(out[[4]], out[[5]], out[[6]], ncol = 3)
grid.arrange(out[[7]], out[[8]], out[[9]], ncol = 3)
grid.arrange(out[[10]], out[[11]], out[[12]], ncol = 3)
grid.arrange(out[[13]], out[[14]], out[[15]], ncol = 3)
grid.arrange(out[[16]], ncol = 1)
rm(lvls,out,df,i)
train %>% select(target, v30) %>%
mutate(miss=ifelse(is.na(v30),1,0)) %>%
group_by(miss) %>%
summarize(mean(target))
## Source: local data frame [2 x 2]
##
## miss mean(target)
## (dbl) (dbl)
## 1 0 0.7588866
## 2 1 0.7632840
train %>% select(target, v113) %>%
mutate(miss=ifelse(is.na(v113),1,0)) %>%
group_by(miss) %>%
summarize(mean(target))
## Source: local data frame [2 x 2]
##
## miss mean(target)
## (dbl) (dbl)
## 1 0 0.7024586
## 2 1 0.8238825
We impute missing value for features with less than 5% of missing value. We also recode categorical values with low counts to the category with the highest counts.
v14 - replace negative values to 0.
Convert categorical values to integers, and replace missing value with -1.
Impute missing values for numeric features with less than 5% missing values.
Add an indicator (vComplete) for complete case (observations without missing values) and indicator for each feature to indicate if value is present or missing. Replace missing values with -1.
Recode categorical values with low counts.
#v14 - change values < 0 to 0
train_Imp <- mutate(train, v14=ifelse(v14<0,0,v14)) %>% select(-ID)
#Replace categorical features with integers and missing as -1
fac <- sapply(train_Imp, is.factor)
train_Imp[fac] <- lapply(train_Imp[fac],function(x){ifelse(is.na(x),as.integer(-1),as.integer(x))})
#Impute missing values for numeric features v10,v12,v14,v21,v34,v40, v50 and v114
#Use information for the categorical features as well
feature.names <- names(select(train_fac,-ID))
feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
summary(train_Imp[feature.names])
## target v3 v22 v24
## Min. :0.0000 Min. :-1.000 Min. : -1 Min. :1.000
## 1st Qu.:1.0000 1st Qu.: 3.000 1st Qu.: 4363 1st Qu.:3.000
## Median :1.0000 Median : 3.000 Median : 9340 Median :4.000
## Mean :0.7612 Mean : 2.875 Mean : 9177 Mean :4.058
## 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:13638 3rd Qu.:5.000
## Max. :1.0000 Max. : 3.000 Max. :18210 Max. :5.000
##
## v30 v31 v47 v52
## Min. :-1.000 Min. :-1.000 Min. : 1.000 Min. :-1.000
## 1st Qu.:-1.000 1st Qu.: 1.000 1st Qu.: 3.000 1st Qu.: 4.000
## Median :-1.000 Median : 1.000 Median : 4.000 Median : 7.000
## Mean : 1.326 Mean : 1.168 Mean : 5.606 Mean : 6.558
## 3rd Qu.: 3.000 3rd Qu.: 1.000 3rd Qu.: 9.000 3rd Qu.:10.000
## Max. : 7.000 Max. : 3.000 Max. :10.000 Max. :12.000
##
## v56 v66 v71 v74
## Min. : -1.00 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.: 22.00 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:2.000
## Median : 51.00 Median :1.000 Median :5.00 Median :2.000
## Mean : 58.59 Mean :1.609 Mean :4.05 Mean :2.006
## 3rd Qu.: 88.00 3rd Qu.:2.000 3rd Qu.:5.00 3rd Qu.:2.000
## Max. :122.00 Max. :3.000 Max. :9.00 Max. :3.000
##
## v75 v79 v91 v107
## Min. :1.000 Min. : 1.000 Min. :-1.000 Min. :-1.000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median :4.000 Median : 3.000 Median : 3.000 Median : 3.000
## Mean :3.314 Mean : 5.038 Mean : 3.597 Mean : 3.335
## 3rd Qu.:4.000 3rd Qu.: 5.000 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :4.000 Max. :18.000 Max. : 7.000 Max. : 7.000
##
## v110 v112 v113 v125
## Min. :1.000 Min. :-1.000 Min. :-1.000 Min. :-1.00
## 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.:-1.000 1st Qu.:26.00
## Median :2.000 Median : 9.000 Median : 4.000 Median :46.00
## Mean :1.541 Mean : 9.685 Mean : 8.886 Mean :46.72
## 3rd Qu.:2.000 3rd Qu.:14.000 3rd Qu.:18.000 3rd Qu.:69.00
## Max. :3.000 Max. :22.000 Max. :36.000 Max. :90.00
##
## v10 v12 v14 v21
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.1062
## 1st Qu.: 1.050 1st Qu.: 6.322 1st Qu.:11.26 1st Qu.: 6.4155
## Median : 1.313 Median : 6.613 Median :11.97 Median : 7.0454
## Mean : 1.883 Mean : 6.881 Mean :12.09 Mean : 7.0297
## 3rd Qu.: 2.101 3rd Qu.: 7.020 3rd Qu.:12.72 3rd Qu.: 7.6706
## Max. :18.534 Max. :18.711 Max. :20.00 Max. :19.2961
## NA's :84 NA's :86 NA's :4 NA's :611
## v34 v40 v50 v114
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 5.054 1st Qu.: 8.408 1st Qu.: 0.6588 1st Qu.:12.00
## Median : 6.537 Median :10.334 Median : 1.2119 Median :14.04
## Mean : 6.406 Mean :10.466 Mean : 1.5043 Mean :13.57
## 3rd Qu.: 7.703 3rd Qu.:12.765 3rd Qu.: 2.0072 3rd Qu.:15.37
## Max. :20.000 Max. :20.000 Max. :20.0000 Max. :20.00
## NA's :111 NA's :111 NA's :86 NA's :30
set.seed(12345)
imputed = complete(mice(train_Imp[feature.names]))
##
## iter imp variable
## 1 1 v10 v12 v14 v21 v34 v40 v50 v114
## 1 2 v10 v12 v14 v21 v34 v40 v50 v114
## 1 3 v10 v12 v14 v21 v34 v40 v50 v114
## 1 4 v10 v12 v14 v21 v34 v40 v50 v114
## 1 5 v10 v12 v14 v21 v34 v40 v50 v114
## 2 1 v10 v12 v14 v21 v34 v40 v50 v114
## 2 2 v10 v12 v14 v21 v34 v40 v50 v114
## 2 3 v10 v12 v14 v21 v34 v40 v50 v114
## 2 4 v10 v12 v14 v21 v34 v40 v50 v114
## 2 5 v10 v12 v14 v21 v34 v40 v50 v114
## 3 1 v10 v12 v14 v21 v34 v40 v50 v114
## 3 2 v10 v12 v14 v21 v34 v40 v50 v114
## 3 3 v10 v12 v14 v21 v34 v40 v50 v114
## 3 4 v10 v12 v14 v21 v34 v40 v50 v114
## 3 5 v10 v12 v14 v21 v34 v40 v50 v114
## 4 1 v10 v12 v14 v21 v34 v40 v50 v114
## 4 2 v10 v12 v14 v21 v34 v40 v50 v114
## 4 3 v10 v12 v14 v21 v34 v40 v50 v114
## 4 4 v10 v12 v14 v21 v34 v40 v50 v114
## 4 5 v10 v12 v14 v21 v34 v40 v50 v114
## 5 1 v10 v12 v14 v21 v34 v40 v50 v114
## 5 2 v10 v12 v14 v21 v34 v40 v50 v114
## 5 3 v10 v12 v14 v21 v34 v40 v50 v114
## 5 4 v10 v12 v14 v21 v34 v40 v50 v114
## 5 5 v10 v12 v14 v21 v34 v40 v50 v114
train_Imp[feature.names] = imputed[feature.names]
#check that they are not missing anymore
summary(train_Imp[feature.names])
## target v3 v22 v24
## Min. :0.0000 Min. :-1.000 Min. : -1 Min. :1.000
## 1st Qu.:1.0000 1st Qu.: 3.000 1st Qu.: 4363 1st Qu.:3.000
## Median :1.0000 Median : 3.000 Median : 9340 Median :4.000
## Mean :0.7612 Mean : 2.875 Mean : 9177 Mean :4.058
## 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:13638 3rd Qu.:5.000
## Max. :1.0000 Max. : 3.000 Max. :18210 Max. :5.000
## v30 v31 v47 v52
## Min. :-1.000 Min. :-1.000 Min. : 1.000 Min. :-1.000
## 1st Qu.:-1.000 1st Qu.: 1.000 1st Qu.: 3.000 1st Qu.: 4.000
## Median :-1.000 Median : 1.000 Median : 4.000 Median : 7.000
## Mean : 1.326 Mean : 1.168 Mean : 5.606 Mean : 6.558
## 3rd Qu.: 3.000 3rd Qu.: 1.000 3rd Qu.: 9.000 3rd Qu.:10.000
## Max. : 7.000 Max. : 3.000 Max. :10.000 Max. :12.000
## v56 v66 v71 v74
## Min. : -1.00 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.: 22.00 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:2.000
## Median : 51.00 Median :1.000 Median :5.00 Median :2.000
## Mean : 58.59 Mean :1.609 Mean :4.05 Mean :2.006
## 3rd Qu.: 88.00 3rd Qu.:2.000 3rd Qu.:5.00 3rd Qu.:2.000
## Max. :122.00 Max. :3.000 Max. :9.00 Max. :3.000
## v75 v79 v91 v107
## Min. :1.000 Min. : 1.000 Min. :-1.000 Min. :-1.000
## 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median :4.000 Median : 3.000 Median : 3.000 Median : 3.000
## Mean :3.314 Mean : 5.038 Mean : 3.597 Mean : 3.335
## 3rd Qu.:4.000 3rd Qu.: 5.000 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :4.000 Max. :18.000 Max. : 7.000 Max. : 7.000
## v110 v112 v113 v125
## Min. :1.000 Min. :-1.000 Min. :-1.000 Min. :-1.00
## 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.:-1.000 1st Qu.:26.00
## Median :2.000 Median : 9.000 Median : 4.000 Median :46.00
## Mean :1.541 Mean : 9.685 Mean : 8.886 Mean :46.72
## 3rd Qu.:2.000 3rd Qu.:14.000 3rd Qu.:18.000 3rd Qu.:69.00
## Max. :3.000 Max. :22.000 Max. :36.000 Max. :90.00
## v10 v12 v14 v21
## Min. :-0.000001 Min. : 0.000001 Min. : 0.00 Min. : 0.1062
## 1st Qu.: 1.050328 1st Qu.: 6.322328 1st Qu.:11.26 1st Qu.: 6.4165
## Median : 1.312910 Median : 6.612881 Median :11.97 Median : 7.0481
## Mean : 1.882962 Mean : 6.881225 Mean :12.09 Mean : 7.0345
## 3rd Qu.: 2.100657 3rd Qu.: 7.019932 3rd Qu.:12.72 3rd Qu.: 7.6764
## Max. :18.533920 Max. :18.710550 Max. :20.00 Max. :19.2961
## v34 v40 v50 v114
## Min. :-0.000001 Min. : 0.000 Min. :-0.000001 Min. : 0.00
## 1st Qu.: 5.054449 1st Qu.: 8.408 1st Qu.: 0.658792 1st Qu.:12.00
## Median : 6.537180 Median :10.334 Median : 1.211944 Median :14.04
## Mean : 6.406581 Mean :10.465 Mean : 1.504475 Mean :13.57
## 3rd Qu.: 7.702727 3rd Qu.:12.764 3rd Qu.: 2.007189 3rd Qu.:15.37
## Max. :20.000000 Max. :20.000 Max. :20.000000 Max. :20.00
#Add indicator for each feature to indicate observation with missing value, and complete cases
feature.names <- names(select(train_num,-ID))
tmp <- data.frame(vComplete = as.integer(complete.cases(train_Imp)))
train_Imp <- bind_cols(train_Imp, tmp)
feature.names <- names(select(train_num,-c(ID,target,v10,v12,v14,v21,v34,v40,v50,v114)))
for(i in 1:length(feature.names)) {
train_Imp[[paste(feature.names[i],"Miss", sep="_")]] <- apply(train_Imp[feature.names[i]],2,function(x){ifelse(is.na(x),1,0)})
}
#Recode NA to -1
train_Imp[feature.names] <- lapply(train_Imp[feature.names], function(x){ifelse(is.na(x),-1,x)})
#Recode categories with low counts
train_Imp$v47 <- ifelse(train_Imp$v47 %in% c(8), as.integer(9), train_Imp$v47)
train_Imp$v52 <- ifelse(train_Imp$v52 %in% c(-1), as.integer(10), train_Imp$v52)
train_Imp$v71 <- ifelse(train_Imp$v71 %in% c(1,4,6,7,8,9), as.integer(5), train_Imp$v71)
train_Imp$v79 <- ifelse(train_Imp$v79 %in% c(1,6,7,8,10,12,14,17,18), as.integer(19), train_Imp$v79)
train_Imp$v91 <- ifelse(train_Imp$v91 %in% c(-1), as.integer(1), train_Imp$v91)
train_Imp$v113 <- ifelse(train_Imp$v113 %in% c(2,12,19), as.integer(18), train_Imp$v113)
rm(imputed, fac, tmp, feature.names,i)
We looked at the association between the categorical features (pairwise only) and the correlation between the numeric featuress to see if we can reduce some of the features. We want features which are independent from each other but which the target variable is dependent on.
We measure the strength of the associations for the categorical features using Cramer-V statistics from the vcd package. There were a number of pairwise features with strong association (> 0.7). We have perfect relationship between v47 and v110, v79 and v110, v91 and v107. A cross tabulation of the features indicate that one of the features carry the same information. We drop variables v110, v107 as they are redundant.
out = data.frame(var1=character(),
var2=character(),
cramerV=double())
#Exclude features with high number of factors
feature.names <- names(select(train_fac,-c(ID,v22,v56,v125)))
tmp <- train_Imp[feature.names]
for(i in 1:ncol(tmp)){
for(j in 1:ncol(tmp))
if (i != j){
tbl <- select(tmp,c(i,j))
xtab <- xtabs(~., tbl)
newrow = data.frame(var1=names(tmp[,i]),
var2=names(tmp[,j]),
cramerV=assocstats(xtab)$cramer)
out <- rbind(out, newrow)
}
}
out %>% filter(cramerV > 0.7) %>% arrange(-cramerV)
## var1 var2 cramerV
## 1 v91 v107 1.0000000
## 2 v107 v91 1.0000000
## 3 v47 v110 0.9999910
## 4 v110 v47 0.9999910
## 5 v79 v110 0.9261114
## 6 v110 v79 0.9261114
## 7 v47 v79 0.8175226
## 8 v79 v47 0.8175226
## 9 v31 v110 0.7689007
## 10 v110 v31 0.7689007
## 11 v113 v110 0.7515573
## 12 v110 v113 0.7515573
## 13 v71 v75 0.7069755
## 14 v75 v71 0.7069755
table(tmp$v47,tmp$v110) #Drop v110, as give same information. 2 in v110 same as v3
##
## 1 2 3
## 1 38 0 0
## 2 0 0 50
## 3 0 55425 0
## 4 0 0 3157
## 5 5301 0 0
## 6 4322 0 0
## 7 3946 0 0
## 9 39071 1 0
## 10 3010 0 0
table(tmp$v79,tmp$v110) #Drop v110 give same information
##
## 1 2 3
## 2 0 25801 0
## 3 34561 0 0
## 4 5302 0 0
## 5 0 25257 0
## 9 4561 0 0
## 11 4308 0 0
## 13 3946 0 0
## 15 0 3331 0
## 16 0 0 2217
## 19 3010 1037 990
table(tmp$v91,tmp$v107) #Drop v107
##
## -1 1 2 3 4 5 6 7
## 1 3 0 0 0 0 27079 0 0
## 2 0 0 22683 0 0 0 0 0
## 3 0 0 0 0 23157 0 0 0
## 4 0 0 0 0 0 0 0 230
## 5 0 0 0 0 0 0 3206 0
## 6 0 13418 0 0 0 0 0 0
## 7 0 0 0 24545 0 0 0 0
rm(feature.names, tmp, out)
We use the corrplot package to visualise the correlation matrix for numeric features. We firstly looked the correlation between the non-missing numeric features using all the data. From the correlogram, we can see that some of the features are highly correlated.There are 5 pairs of features with the correlations > |0.7|.
#Correlation between non-missing numeric features
feature.names <- c("v10","v12","v14","v21","v34","v40","v50","v114")
tmp <- train_Imp[feature.names]
cor <- cor(tmp, use="everything")
col <- colorRampPalette(c("red", "white", "blue"))(10)
corrplot(cor, type="lower", order="hclust", col=col, tl.cex=1, tl.col="black", tl.srt=0)
cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor))
cor <- cor %>%
filter(!is.na(Freq)) %>%
arrange(Var1, Var2)
filter(cor, abs(Freq) > 0.7)
## Var1 Var2 Freq
## 1 v10 v12 0.9118438
## 2 v14 v21 0.8113106
## 3 v34 v40 -0.9030706
## 4 v34 v114 0.9116072
## 5 v40 v114 -0.9662656
A plot on a random sample of data shows some interesting patterns and relationships between the features on the scatter plots. v10 seems to be of some intervals and unusual pattern between v34 and v40, v12 and v50 and v40 and v114.
#take random subset of 5000 observations
tmp <- train_Imp[c("target",feature.names)] %>%
mutate(target=factor(target)) %>%
sample_n(5000)
#Plot all pairwise features
out <- NULL
x <- as.character(cor$Var1)
y <- as.character(cor$Var2)
for(i in 1:nrow(cor)){
df <- tmp[c("target",x[i],y[i])]
out[[i]] <- ggplot(df, aes_string(x=colnames(df)[2],
y=colnames(df)[3],
fill="target")) +
geom_point(cex=3, pch=21)
}
grid.arrange(out[[1]], out[[2]], out[[3]], out[[4]], ncol = 4)
grid.arrange(out[[5]], out[[6]], out[[7]], out[[8]], ncol = 4)
grid.arrange(out[[9]], out[[10]], out[[11]], out[[12]], ncol = 4)
grid.arrange(out[[13]], out[[14]], out[[15]], out[[16]], ncol = 4)
grid.arrange(out[[17]], out[[18]], out[[19]], out[[20]], ncol = 4)
grid.arrange(out[[21]], out[[22]], out[[23]], out[[24]], ncol = 4)
grid.arrange(out[[25]], out[[26]], out[[27]], out[[28]], ncol = 4)
rm(feature.names,tmp,cor,col,high_cor,out,x,y,df)
We then looked at the correlation between complete cases. From the correlogram, there are some features that are highly correlated. There is also clusters of features that are positive correlated, but negatively correlated with another group of features.
There are 61 pairs of features with correlation greater than 0.9. Some of the features are appear numerous times (v25 - 5 times, v8, v29, v33, v46 - 4 times) which suggest that the are linked.
For prediction of the target value, we want to choose numeric features that are correlated with the target but the features are not correlated.
# ALL numeric variables
feature.names <- names(select(train_num,-c(ID,target)))
tmp <- filter(train_Imp,vComplete == 1)[feature.names]
cor <- cor(tmp, use="everything")
col<- colorRampPalette(c("red", "white", "blue"))(10)
corrplot(cor, type="lower", order="hclust", method="color", col=col, tl.cex=0.4, tl.col="black", tl.srt=0)
cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor))
cor <- cor %>%
filter(!is.na(Freq)) %>%
arrange(Freq)
#correlation greater than 0.9
high_cor <- filter(cor, abs(Freq) > 0.9)
slice(high_cor, 1:10)
## Var1 Var2 Freq
## 1 v58 v100 -0.9976676
## 2 v69 v115 -0.9936711
## 3 v48 v106 -0.9826985
## 4 v40 v114 -0.9671537
## 5 v48 v64 -0.9648004
## 6 v39 v68 -0.9473669
## 7 v17 v48 -0.9129205
## 8 v5 v81 -0.9053241
## 9 v55 v83 0.9004106
## 10 v83 v111 0.9019186
#Count number of times a feature is correlated with another
mutate(high_cor, Freq=1) %>% group_by(Var1) %>% summarise(count=sum(Freq)) %>% filter(count > 1) %>% arrange(-count)
## Source: local data frame [15 x 2]
##
## Var1 count
## (fctr) (dbl)
## 1 v25 5
## 2 v8 4
## 3 v29 4
## 4 v33 4
## 5 v46 4
## 6 v17 3
## 7 v41 3
## 8 v54 3
## 9 v15 2
## 10 v26 2
## 11 v32 2
## 12 v48 2
## 13 v63 2
## 14 v64 2
## 15 v83 2
#take random subset of data
tmp <- train_Imp[c("target","vComplete",feature.names)] %>%
filter(vComplete==1) %>%
mutate(target=factor(target)) %>%
sample_n(5000)
#Plot random sample of 28 pairs
cor <- cor %>% sample_n(28)
out <- NULL
x <- as.character(cor$Var1)
y <- as.character(cor$Var2)
for(i in 1:nrow(cor)){
df <- tmp[c("target",x[i],y[i])]
out[[i]] <- ggplot(df, aes_string(x=colnames(df)[2],
y=colnames(df)[3],
fill="target")) +
geom_point(cex=3, pch=21)
}
grid.arrange(out[[1]], out[[2]], out[[3]], out[[4]], ncol = 4)
grid.arrange(out[[5]], out[[6]], out[[7]], out[[8]], ncol = 4)
grid.arrange(out[[9]], out[[10]], out[[11]], out[[12]], ncol = 4)
grid.arrange(out[[13]], out[[14]], out[[15]], out[[16]], ncol = 4)
grid.arrange(out[[17]], out[[18]], out[[19]], out[[20]], ncol = 4)
grid.arrange(out[[21]], out[[22]], out[[23]], out[[24]], ncol = 4)
grid.arrange(out[[25]], out[[26]], out[[27]], out[[28]], ncol = 4)
#Correlations between Target and numeric variables
feature.names <- names(select(train_num,-c(ID)))
tmp <- filter(train_Imp,vComplete == 1)[feature.names]
cor <- cor(tmp, use="everything")
cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor))
#Retain correlation over 5%
cor <- cor %>%
filter(!is.na(Freq), Var1=="target", abs(Freq) > 0.06) %>% arrange(-abs(Freq))
slice(cor, 1:10)
## Var1 Var2 Freq
## 1 target v50 0.23244609
## 2 target v10 0.14635601
## 3 target v129 0.14476307
## 4 target v14 0.14422416
## 5 target v34 0.09390153
## 6 target v62 -0.09282448
## 7 target v21 0.08204274
## 8 target v72 0.07654414
## 9 target v114 0.06824029
## 10 target v38 0.06265813
numeric.features <- as.character(t(cor$Var2))
rm(feature.names,tmp,cor,col,high_cor,out,x,y,df)
We split the train data provided to 80% training and 20% test to assess the model accuracy.
1. Baseline Model
We know from the data that 76.1% of the claims were classified as suitable for accelerated approval. So our baseline model is to simply predict the probability of target = 1 as the proportion of target = 1 in the data. A baseline benchmark scores 0.54441 in the test data set.
#Functiom to evaluate the log loss metric
logLoss <- function(true_target, pred_prob){
eps <- 1e-15
pred_prob[pred_prob < eps] <- eps
pred_prob[pred_prob > 1-eps] <- 1-eps
out <- -1/length(true_target) * sum(true_target*log(pred_prob) +
(1-true_target)*log(1-(pred_prob)))
names(out) <- "logLoss"
out
}
#Split into training and test set (80:20)
set.seed(3234) # for reproducibility
inTrain <- createDataPartition(y = train_Imp$target, p=0.8)
train_set <- slice(train_Imp, inTrain$Resample1)
test_set <- slice(train_Imp, -inTrain$Resample1)
#1. BaseLine prediction - Mean
mu <- mean(train_set$target)
pred <- rep(mu, nrow(test_set))
naive_loss<- logLoss(test_set$target, pred)
naive_loss
## logLoss
## 0.5444054
logLoss_results <- data_frame(Method = "Average", logLoss = naive_loss)
2. Logistic Regression, Model 1
We fit a logistic regression using features without missing values - categorical variables (excl. v22, v56,v125, v107,v110) and numeric features - v10,v12,v14,v21,v34,v40, v50 and v114. We improved the logloss to 0.4872.
library("rpart")
## Warning: package 'rpart' was built under R version 3.2.5
library("rpart.plot")
## Warning: package 'rpart.plot' was built under R version 3.2.5
#Logistic Regression, Model 1
feature.names <- names(select(train_fac,-c(ID,target,v22,v56, v125,v107,v110)))
feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
train_glm <- train_set[c("target",feature.names)]
test_glm <- test_set[c("target",feature.names)]
glm_fit1 <- glm(target ~., data=train_glm, family="binomial")
pred1 = predict(glm_fit1, newdata = test_glm, type = "response")
glm_loss1 <- logLoss(test_glm$target, pred1)
glm_loss1
## logLoss
## 0.4872247
logLoss_results <- bind_rows(logLoss_results,
data_frame(Method="Logit Model 1",
logLoss = glm_loss1))
rm(train_glm, test_glm)
3. Logistic Regression, Model 2
We included numeric features we thought might be predictive to our logistic model and added the dummy variable for missing values. We improved the logloss slightly to 0.4868.
feature.names <- names(select(train_fac,-c(ID,target,v22,v56, v125,v107,v110)))
feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
numeric.features <- c("v2","v4","v28","v36","v44","v54","v61","v63","v64","v81","v87","v100","v106","v98","v119","v123","v129")
dummy.features = NULL
for(i in 1:length(numeric.features)) {
dummy.features <- c(dummy.features, paste(numeric.features[i],"Miss", sep="_"))
}
feature.names <- unique(c(feature.names, numeric.features, dummy.features))
train_glm <- train_set[c("target",feature.names)]
test_glm <- test_set[c("target",feature.names)]
glm_fit <- glm(target ~., data=train_glm, family="binomial")
pred2 = predict(glm_fit, newdata = test_glm, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
glm_loss2 <- logLoss(test_glm$target, pred2)
glm_loss2
## logLoss
## 0.4868249
logLoss_results <- bind_rows(logLoss_results,
data_frame(Method="Logit Model 2",
logLoss = glm_loss2))
rm(train_glm, test_glm)
4. Tree-Based Models
We tried two tree based model - decision tree and random forest.
In the classification tree, we used the rpart package choosing only the features with no missing value as in the logistic Model 1. We pruned the tree to avoid over-fitting. The first split is on v50 which we saw previously had the highest correlation with the target variable. We get a logloss of 0.4975 which is not as good as the logistic model.
feature.names <- names(select(train_fac,-c(ID,target,v22,v56,v125,v107,v110)))
feature.names <- c(feature.names,"v10","v12","v14","v21","v34","v40","v50","v114")
train_cart <- train_set[c("target", feature.names)]
test_cart <- test_set[c("target", feature.names)]
#Convert categorical features into factor
feature.names <- names(select(train_fac,-c(ID, v22,v56, v125,v107,v110)))
train_cart[feature.names] <- lapply(train_cart[feature.names], as.factor)
test_cart[feature.names] <- lapply(test_cart[feature.names], as.factor)
#Fit the model
cart_fit = rpart(target ~ ., data = train_cart,method="class", control = rpart.control(minsplit=20, cp = 0.0005))
# plot the tree
printcp(cart_fit)
##
## Classification tree:
## rpart(formula = target ~ ., data = train_cart, method = "class",
## control = rpart.control(minsplit = 20, cp = 5e-04))
##
## Variables actually used in tree construction:
## [1] v10 v112 v113 v114 v12 v21 v24 v30 v31 v34 v40 v47 v50 v52
## [15] v66 v79 v91
##
## Root node error: 21945/91457 = 0.23995
##
## n= 91457
##
## CP nsplit rel error xerror xstd
## 1 0.01480975 0 1.00000 1.00000 0.0058851
## 2 0.00546822 3 0.95557 0.96031 0.0058031
## 3 0.00248348 6 0.93917 0.94094 0.0057616
## 4 0.00209615 8 0.93420 0.93862 0.0057566
## 5 0.00179236 9 0.93210 0.93739 0.0057539
## 6 0.00173160 12 0.92673 0.93128 0.0057406
## 7 0.00132149 16 0.91980 0.92882 0.0057352
## 8 0.00129870 17 0.91848 0.92987 0.0057375
## 9 0.00127592 19 0.91588 0.92951 0.0057367
## 10 0.00100251 20 0.91460 0.92832 0.0057341
## 11 0.00097972 22 0.91260 0.92709 0.0057313
## 12 0.00088859 26 0.90823 0.92700 0.0057311
## 13 0.00082023 28 0.90645 0.92673 0.0057305
## 14 0.00077466 31 0.90399 0.92677 0.0057306
## 15 0.00069872 33 0.90244 0.92618 0.0057293
## 16 0.00068353 36 0.90034 0.92622 0.0057294
## 17 0.00061517 40 0.89761 0.92750 0.0057322
## 18 0.00059239 43 0.89565 0.92627 0.0057295
## 19 0.00057720 44 0.89506 0.92714 0.0057314
## 20 0.00054682 47 0.89332 0.92736 0.0057319
## 21 0.00052404 48 0.89278 0.92736 0.0057319
## 22 0.00050125 50 0.89173 0.92691 0.0057309
## 23 0.00050000 51 0.89123 0.92718 0.0057315
plotcp(cart_fit)
par(mfrow=c(1,1), mar=c(1,1,1,1))
plot(cart_fit, uniform=T, compress=T, margin=0.1, branch=0.3)
text(cart_fit, use.n=T, digits=3, cex=0.6)
# prune the tree to the cp that minimises the error
pfit<- prune(cart_fit, cp= cart_fit$cptable[which.min(cart_fit$cptable[,"xerror"]),"CP"])
# plot the pruned tree
par(mfrow=c(1,1), mar=c(1,1,1,1))
plot(pfit, uniform=T, compress=T, margin=0.1, branch=0.3)
text(pfit, use.n=T, digits=3, cex=0.6)
pred3 = predict(pfit, newdata = test_cart, type = "prob")[,2]
cart_loss <- logLoss(as.integer(test_cart$target)-1, pred3)
cart_loss
## logLoss
## 0.4975313
logLoss_results <- bind_rows(logLoss_results,
data_frame(Method="Decision Tree",
logLoss = cart_loss))
rm(train_cart, test_cart)
library("randomForest")
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
feature.names <- names(select(train_fac,-c(ID,target,v22,v56,v125,v107,v110)))
feature.names <- c(feature.names,"v10","v12","v14","v21","v34","v40","v50","v114")
train_rf <- train_set[c("target",feature.names)]
test_rf <- test_set[c("target",feature.names)]
#Convert categorical features into factor
feature.names <- names(select(train_fac,-c(ID, v22,v56, v125,v107,v110)))
train_rf[feature.names] <- lapply(train_rf[feature.names], as.factor)
test_rf[feature.names] <- lapply(test_rf[feature.names], as.factor)
#Fit the model
rf_fit = randomForest(target ~ ., data = train_rf, ntree=200, nodesize=250, mtry=9, importance=TRUE)
plot(rf_fit)
varImpPlot(rf_fit, sort=T,main="VariableImportance")
var.Imp <- data.frame(importance(rf_fit, type=2))
var.Imp$Variables <- row.names(var.Imp)
var.Imp[order(var.Imp$MeanDecreaseGini, decreasing=T),]
## MeanDecreaseGini Variables
## v50 2442.070134 v50
## v66 722.825683 v66
## v113 658.208236 v113
## v79 515.287728 v79
## v112 423.745541 v112
## v31 275.986120 v31
## v47 256.776633 v47
## v52 185.522127 v52
## v10 184.222232 v10
## v12 174.060250 v12
## v24 162.404674 v24
## v114 150.528028 v114
## v34 126.879858 v34
## v40 123.988029 v40
## v21 109.314066 v21
## v14 92.612291 v14
## v30 85.735695 v30
## v91 34.612551 v91
## v71 10.759146 v71
## v74 5.769893 v74
## v75 4.359533 v75
## v3 3.967014 v3
pred4 = predict(rf_fit, newdata = test_rf, type = "prob")[,2]
rf_loss <- logLoss(as.integer(test_rf$target)-1, pred4)
rf_loss
## logLoss
## 0.5000409
logLoss_results <- bind_rows(logLoss_results,
data_frame(Method="Random Forest",
logLoss = rf_loss))
logLoss_results
## Source: local data frame [5 x 2]
##
## Method logLoss
## (chr) (dbl)
## 1 Average 0.5444054
## 2 Logit Model 1 0.4872247
## 3 Logit Model 2 0.4868249
## 4 Decision Tree 0.4975313
## 5 Random Forest 0.5000409
The goal of the project was to apply machine learning models to predict the category of claims. We tried three models - logistic regression, decision tree and random forest - using various combinations of features. The best logloss score achieved was around 0.49. Dexter’s Lab who won the Kaggle competition achieved a private leaderboard score of 0.42. We were a long way off!
The anonymized dataset made the project so much more challenging and difficult. Not knowing what the features represents and the large number of features with high proportion of missing values increased the complexity. We do not believe that the missing values were random but is due to the types of claims (product types) and the information collected. Apart for some imputation for features with low proportion of missing values, we simply recoded the missing values to a new category/value.
Dexter’s Lab winning solution used xgboost model. XGBoost (extreme gradient boosting), is a tree-based model and is widely used in Kaggle competitions. Unfortunately we did not get to try it out in the project. Another key to winning the competition was feature engineering. By exploring the patterns in the data, Dexter’s Lab figured out the features which represented the dates and constructed a panel time series data.
We spent a major part of the project visualising the data and worrying about missing values. In hindsight, a better approach would be to create a simple random forest model and use variable importance to help us concentrate on visualizing the important features.